It is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.
The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Secondly, staffing of the centers is also one area wherein accurate demand forecasts are really helpful. Given the following information, the task is to predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set
Weekly Demand data (train.csv): Contains the historical demand data for all centers, test.csv contains all the following features except the target variable.
Variable Definition
id Unique ID
week Week No
center_id Unique ID for fulfillment center
meal_id Unique ID for Meal
checkout_price Final price including discount, taxes & delivery charges
base_price Base price of the meal
emailer_for_promotion Emailer sent for promotion of meal
homepage_featured Meal featured at homepage
num_orders (Target) Orders Count
fulfilment_center_info.csv: Contains information for each fulfilment center
Variable Definition
center_id Unique ID for fulfillment center
city_code Unique code for city
region_code Unique code for region
center_type Anonymized center type
op_area Area of operation (in km^2)
meal_info.csv: Contains information for each meal being served
Variable Definition
meal_id Unique ID for the meal
category Type of meal (beverages/snacks/soups….)
cuisine Meal cuisine (Indian/Italian/…)
# Importing Libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings('ignore')
import os
#for dirname, _, filenames in os.walk('\input'):
# for filename in filenames:
# print(os.path.join(dirname, filename))
pd.read_csv('input/train.csv')
# Importing Raw files
train_raw = pd.read_csv('input/train.csv')
test_raw = pd.read_csv('input/test.csv')
meal = pd.read_csv('input/meal_info.csv')
centerinfo = pd.read_csv('input/fulfilment_center_info.csv')
print("The Shape of Demand dataset :",train_raw.shape)
print("The Shape of Fulmilment Center Information dataset :",centerinfo.shape)
print("The Shape of Meal information dataset :",meal.shape)
print("The Shape of Test dataset :",test_raw.shape)
Check the content of each file
train_raw.head()
centerinfo.head()
meal.head()
test_raw.head()
Check for missing values
train_raw.isnull().sum().sum()
test_raw.isnull().sum().sum()
Observation : No missing value found in train and test data
print("The company has" ,centerinfo["center_id"].nunique(), " warehouse " ,
"spreed into " , centerinfo["city_code"].nunique() ,"City and ",
centerinfo["region_code"].nunique() , "Regions")
print("The products of the company are " ,meal["meal_id"].nunique(),"unique meals , devided into "
,meal["category"].nunique(),"category and ",meal["cuisine"].nunique(),"cuisine")
#Merge train data with meal and center info
train = pd.merge(train_raw, meal, on="meal_id", how="left")
train = pd.merge(train, centerinfo, on="center_id", how="left")
print("Shape of train data : ", train.shape)
train.head()
#Merge test data with meal and center info
test= pd.merge(test_raw, meal, on="meal_id", how="outer")
test = pd.merge(test, centerinfo, on="center_id", how="outer")
print("Shape of test data : ", test.shape)
test.head()
# Typecasting to Assign appropriate data type to variables
col_names=['center_id','meal_id','category','cuisine','city_code','region_code','center_type']
train[col_names] = train[col_names].astype('category')
test[col_names] = test[col_names].astype('category')
print("Train Datatype\n",train.dtypes)
print("Test Datatype\n",test.dtypes)
Observation: No missing value in the input dataset
# Orders by centers
center_orders=train.groupby("center_id",as_index=False).sum()
center_orders=center_orders[["center_id","num_orders"]].sort_values(by="num_orders",ascending=False).head(10)
fig=px.bar(x=center_orders["center_id"].astype("str"),y=center_orders["num_orders"],
title="Top 10 Centers by Order",labels={"x":"center_id","y":"num_orders"})
fig.show()
Observation : Center 13 has most order followed by 43 and 10
#Pie chart on food category
fig = px.pie(values=train["category"].value_counts(), names=train["category"].unique(),
title="Most popular food category")
fig.show()
Beverage is by far the most popular food category
# Orders by Cuisine type
cuisine_orders=train.groupby(["cuisine"],as_index=False).sum()
cuisine_orders=cuisine_orders[["cuisine","num_orders"]].sort_values(by="num_orders",ascending=False)
fig=px.bar(cuisine_orders,x="cuisine",y="num_orders", title="orders by cuisine")
fig.show()
Itilian is the most popular cuisine
# Impact of check-out price on order
train_sample = train.sample(frac=0.2)
fig=px.scatter(train_sample,x="checkout_price",y="num_orders",title="number of order change with checkout price")
fig.show()
sns.boxplot(train["checkout_price"])
Observation :
# orders weekly trend
week_orders=train.groupby(["week"],as_index=False).sum()
week_orders=week_orders[["week","num_orders"]]
fig = px.line(week_orders, x="week", y="num_orders",
markers=True,title="Order weekly trend")
fig.show()
Observation :
There is no clear seasonlity observed week over week. Weely orders are in same range with bit of ups and downs.
Outlires are seen at week 62 with lowest total order. But it doesnt seem to be data entry mistake.
With the given data, We have derived the below features to improve our model performance.
Discount Percent : This defines the % discount offer to customer.
Discount Y/N : This defines whether Discount is provided or not - 1 if there is Discount and 0 if there is no Discount
#Discount Percent
train['discount percent'] = ((train['base_price']-train['checkout_price'])/train['base_price'])*100
#Discount Y/N
train['discount y/n'] = [1 if x>0 else 0 for x in (train['base_price']-train['checkout_price'])]
# Creating same feature in test dataset
test['discount percent'] = ((test['base_price']-test['checkout_price'])/test['base_price'])*100
test['discount y/n'] = [1 if x>0 else 0 for x in (test['base_price']-test['checkout_price'])]
train.head(2)
# Check for correlation between numeric features
plt.figure(figsize=(13,13))
sns.heatmap(train.corr(),linewidths=.1,cmap='Reds',annot=True)
plt.title('Correlation Matrix')
plt.show()
Observation:
Model evaluation metric is 100*RMSLE where RMSLE is Root of Mean Squared Logarithmic Error across all entries in the test set.
#Define One hot encoding function
def one_hot_encode(features_to_encode, dataset):
encoder = OneHotEncoder(sparse=False)
encoder.fit(dataset[features_to_encode])
encoded_cols = pd.DataFrame(encoder.transform(dataset[features_to_encode]),columns=encoder.get_feature_names())
dataset = dataset.drop(columns=features_to_encode)
for cols in encoded_cols.columns:
dataset[cols] = encoded_cols[cols]
return dataset
#get list of caterogical varibles in data set
ls = train.select_dtypes(include='category').columns.values.tolist()
# Run one-hot encoding on all categorical variables
features_to_encode = ls
data = one_hot_encode(features_to_encode, train)
data = data.reset_index(drop = True)
# Train-Validation Data Split
y = data[["num_orders"]]
X= data.drop(["num_orders","id","base_price","discount y/n"],axis = 1)
X= X.replace((np.inf, -np.inf, np.nan), 0) # replace nan and infity values with 0
# 20% of train data is used for validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=100)
# Prepare test data post applying onehot encoding
OH_test = one_hot_encode(features_to_encode, test)
test_final= OH_test.drop(["id","base_price","discount y/n"],axis = 1)
#create pipeline for scaling and modeling
RF_pipe = make_pipeline(StandardScaler(),RandomForestRegressor(n_estimators = 100,max_depth = 7))
#Build Model
RF_pipe.fit(X_train, y_train)
#Predict Value
RF_train_y_pred = RF_pipe.predict(X_val)
# Model Evaluation-
print('R Square:',RF_pipe.score(X_val, y_val))
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, RF_train_y_pred)))
test_y_pred = RF_pipe.predict(test_final)
Result=pd.DataFrame(test_y_pred)
Result.values
Result=pd.DataFrame(test_y_pred)
Submission = pd.DataFrame(columns=['id', 'num_orders'])
Submission['id'] = test['id']
Submission['num_orders'] = Result.values
Submission.to_csv('output\submission_rfr.csv', index=False)
print(Submission.shape)
Submission.head()